.ifaftae;  tMtliSii 


% >' 


yy.  4 § 

^ s a.  3y  'siPr&f 


The  week  reported  in  this  document  was  performed  at  Uncola  Laboratory, 
a center  for  research  operated  by  Massachusetts  Institute  of  Technology . 
This  work  wa*  sponsored  by  the  Defense  Advanced  Research  Projects 
Agency  under  Air  Force  Contract  Fl%2A-78-C-0002  (ARPA  Order  2006), 

This  report  may  be  reproduced  tosatisfy  needs  of  U.S.  Government  agencies. 


The  views  and  conclusions  contained  to  this  document  are  those  of  the 
contractor  and  should  not  be  interpreted  as  necessarily  representing  the 
official  policies,  either  expressed  or  implied,  of  the  United  States 

Government . 


This  technical  report  has  boon  ‘viewed  mi  is  approved  for  pul 

FOR  THE  COMMANDER 


Raymond  L.  LeiseHe,  U.CoL,  USAF 
Chief,  ESI)  Lincoln  Laboratory  Project  Off  sc  <:• 


0 


T R-lg-iSl- 


<& 


MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 


LINCOLN  LABORATORY 


UDMOMOR  PHIC  JPITCH  JpTECTION, 


■U  O J \ ci  £ 


R B./fAUL 
Group  24 


I 


^ \ JECHNICAL  /oTE, 


FJTK  7%-nb-(L-00<$^ 

\_  •*/  apPA  <3fJav-a^f4  j 


LEXINGTON 


MASSACHUSETTS 


itf  18  029 


cohererce  seeking  pitch  detector  is  presented. 


ill 


Contents 

Abstract 

I.  Introduction 

II.  The  Speech  Model 

III.  The  Basic  Homomorphic  Pitch  Detector 

IV.  Practical  Aspects  of  Homomorphic  Pitch 

V.  The  Implementation 

VI.  Results 

VII.  Discussion 

VIII.  Summary 

IX.  Appendix 

X.  Bibliography 


I 

t 

ill 

I 

1 

2 

4 

Detection  9 

14 

22 

26 

29 

31 

39 


I.  Introduction 


for  the  past  few  decades,  there  has  teen  much  interest  in 
speech  bandwidth  compression  systems.  The  homomorphic  vocoder 
algorithm  and  its  related  pitch  detection  algorithm  [5,6],  both 
proposed  in  the  mid-tc-late  sixties,  have  undergone  little 
development,  primarily  due  tc  the  complex  hardware  required  to 
implement  the  Fourier  transforms  of  the  algorithm  in  real-time. 

(A  real-time  capability  is  required  for  any  practical  vocoder  and 
is  extremely  helpful  fer  research  versions.)  This  obstacle  has 
been  removed  by  recent  developments  ir.  charge-ccupled  device 
(CCC ) technology,  which  promise  fast  computation  of  the  Fourier 
transform  with  relatively  simple  hardware.  The  work  presented 
here  was  undertaken  to  improve  the  performance  of  the  homomorphic 
pitch  detection  algorithm  in  anticipation  of  its  eventual 
implementation  using  CCE  chirp-z  transform  chips. 

First,  the  human  speech  production  model  and  the  theory  of 
the  homomorphic  processor  are  described.  Then,  some  practical 
difficulties  encountered  by  the  basic  homomorphic  pitch  detector 
are  covered  in  additicn  tc  the  details  of  the  real-time 
implementation  used  here.  The  final  sections  summarize  and 
analyze  the  results  of  the  real-time  tests  of  the  algorithm.  A 
pilot  study  of  a method  for  improving  the  performance  of  the 
homomorphic  pitch  detector  (or  any  ether  coherence  seeking  pitch 


1 


detector)  is  presented  in  the  Appendix. 

II.  The  Speech  flodel 

The  human  speech  production  system  consists  of  an  air 
pressure  source  (the  lungs)  feeding  through  the  vocal  cords  and 
combined  nasal  and  oral  passages.  The  vocal  cords  can  be  caused 
to  vitrate  and  provide  a periodic  excitation  to  the  vocal  tract. 
Alternatively,  the  vocal  cords  can  be  abducted  to  allow  airflow 
into  the  tract.  A constriction  higher  in  the  tract  will  then 
cause  turbulence  noise  to  be  generated  just  downstream  of  the 
constriction.  The  oral  and  nasal  cavities  form  a variable 
configuration  (straight  tube  or  tube  with  sidebranch)  deformable 
set  of  resonators  connected  tc  the  excitation  sources.  A 
simplified  version  of  this  mcdel--a  periodic  pulse  cr  white  noise 
source  feeding  a time  varying  filter--is  generally  used  for 
vocoders  derived  from  speech  production  models  [2]. 

This  simplified  mcdel  (Figure  1)  is  adequate  most  of  the 
time  for  most  speakers.  Its  description  cf  the  voiced  excitation 
breaks  down  under  several  situations.  At  the  beginning  or  end  of 
voicing,  the  interval  between  glottal  pulses  can  change  very 
rapidly.  Any  pitch  detector  which  requires  "local  stationarity", 
i.e.,  stationarity  over  e small  window,  is  likely  tc  find  this 
zone  of  speech  to  be  unvoiced.  Another  troublesome  mode  of 


2 


WHITE  NOISE  (unvoiced) 

Fig.  1.  Speech  generation  model. 


■ d&LTL 


voicing  is  diplophcnia  [4],  in  which  alternate  pitch  periods  are 
more  highly  correlated  than  adjacent  pitch  periods.  Not  only 
does  this  violate  local  stationarity , but  it  poses  the  additional 
guestion  of  how  the  listener  perceives  such  an  excitaticn — with  a 
period  equal  to  the  average  cf  the  twc  adjacent  pitch  periods  or 
with  a period  equal  to  the  sum  of  the  t vc  adjacent  pitch  periods. 

Many  languages  including  English  contain  a class  of 
phonemes,  the  voiced  fricatives,  in  which  the  excitaticn  has  both 
periodic  and  white  noise  components.  The  binary  source  model  is 
clearly  incapable  cf  correctly  describing  such  a phoneme.  Some 
African  languages  include  sounds  such  as  clicks  whicn  also  are 
not  included  within  the  model.  The  perceptual  consequences  of 
such  errors  in  the  model  vary  wi+ h the  specific  errcr  and  the 
language.  If  the  analyzer  errs  in  a fcrgiving  way,  many  errors 
will  he  corrected  by  the  syntactic  and  semantic  constraints  of  a 
language. 


III.  The  Easic  Hcraomor phic  Fitch  Detectcr 

From  a signal  viewpoint,  the  speech  generation  mcdel  is  just 
a periodic  or  white  signal  scurc°  feeding  a filter. 


4 


s (t)=h  (t)  *u  (t) 


(1) 


where  s(t)=speech  signal 

h(t)=vocal  tract  filter 
u (t) =excitation 

signal--periodic  cr  white 
*=ccnvclution  operator 


Time  domain  convolution  results  in  a frequency  domain  product: 


S(f)=H(f)U(f) 


(2) 


If  one  takes  the  logarithm  of  the  magnitude  of  both  sides. 


log  | S (f ) 1 = lcg  | H (f ) | 4 leg  | 0 (f)  | (3) 

the  product  of  the  right  hand  side  is  transformed  into  a 
summation  of  independent  components.  H (f ) for  an  adult  male 
speaker  uttering  a vowel  contains,  on  the  average,  one  formant 
(resonance)  per  kilohertz.  Therefore,  leg | H (f ) | is  a slowly 
varying  curve.  If  the  utterance  is  unvoiced,  u (t)  is  white  noise 
and  log | U < f ) | is  relatively  flat.  If,  on  the  other  hand,  the 
utterance  is  voiced,  u(t)  and  therefore  1 c g | U ( f ) | are  periodic. 


5 


As  the  typical  range  of  pitches  is  about  60  to  300  Hz,  the 
frequency  domain  periodicity  is  of  relatively  high  frequency. 

L o g | S ( f ) | is  therefore  the  sum  of  a slowly  varying  component  and 
a rapidly  varying  periodic  component. 

A final  Fourier  transform  of  1 o g | S ( f ) | 


c (t)  < >log  | (S  (f ) | (U) 

produces  the  cepstrum  (Figure  2)  and  will  generally  separate  the 
components  along  a time  scale  with  the  lew  erder  components  of 
c (t)  representing  lcg|H(f)|  and,  if  voicing  is  present,  a higher 
order  peak  at  the  periodicity  of  u (t)  representing  log|U(f)|  [6]. 
This  cepstrum  is  then  searched  along  the  zone  corresponding  to 
the  expected  range  of  pitches  for  the  height  and  position  of  the 
pitch  peak.  If  the  peak  height  is  above  a threshold,  the  frame  is 
declared  voiced  with  pitch  equal  to  the  position  of  the  peak. 

On  an  ideal  speech  sound,  this  hcracncrphic  pitch  detector 
(Figurs  3)  works  quite  well.  The  peak  representing  the  pitch 
period  is  sharp  and  clear  whan  the  excitation  is  voiced  and 
absent  when  unvoiced.  Eeal  speech  recorded  with  high  quality 
equipment  in  a quiet  environment  frequently  produces  a well 


LOG  SPECTRUM 
0-2.2  kHz 


Fig.  2.  Waveforms  of  the  basic  homomorphic  processor 


1 

\ 


behaved  cepstrum.  It  alsc  strains  the  inherent  assumptions 
sufficiently  often  that  the  basic  hcmcmcr phic  pitch  detector  will 
generally  perform  fairly  poorly  over  an  entire  utterance.  A 
practical  homomorphic  pitch  detector  therefore  requires  a number 
cf  modifications  and  additional  strategies  to  give  acceptable 
performance. 

IV.  Practical  Aspects  cf  Hcmcmciphic  Pitch  Detection 

The  first  issue  which  a homomorphic  pitch  detectcr  must  face 
is  the  lack  of  stationarity  in  the  signal.  To  preserve  the  local 
staticnarity  that  exists,  the  speech  signal  must  be  windowed  and 
analyzed  as  (hopefully)  stationary  frames.  As  seme  events  in 
(English)  speech  (flap  d and  flap  t)  cccur  in  about  10  mS.,  a 
very  narrow  window  is  required.  The  pitch  detector,  however, 
requires  a high  resolution  spectrum  which  suggests  that  the 
window  be  four  to  five  pitch  periods  leng.  As  typical  pitch 
periods  vary  from  about  2.5  to  20  mS.  the  above  requirements 
cannot  be  met.  Fortunately,  a certain  amount  of  slurring  of  the 
excitation  is  allowable  and  the  optimum  size  is  the  minimum  width 
which  gives  adequate  spectral  resoluticn. 

Given  that  a time  window  is  required,  which  windew  is  the 
best?  The  requirements  here  are  not  unusual:  minimum  width  main 
lobe  and  low  sidelcbes  in  the  frequency  domain  with  minimum  width 

9 

L " r~nm  . - . 


V 


in  the  time  domain.  The  exact  choice  of  window  is  probably  not 
important  as  long  as  one  cf  the  standard  high  quality  (i.e., 
narrow  main  lcbe  and  minimal  sidelobes)  windows  such  as  the 
Hamming  window  is  used. 

Observation  of  high  resolution  speech  spectra  will  show  the 
spectra  of  apparently  ideal  (time  domain)  speech  tc  be  less  than 
ideal.  The  periodicity  cf  the  spectrum  may  break  down  above  one 
kilohertz.  The  voicing  periodicity  of  voiced  fricatives  may  be 
obliterated  above  about  the  same  point  by  the  random  noise 
components.  Finally,  a voiced  sound  may  have  as  much  as  50  db 
dynamic  range  in  the  spectrum  with  the  peak  energy  concentrated 
in  the  region  of  the  first  two  formants  which  lie  telcw  about  1.5 
kHz.  These  factors  suggest  that  the  region  up  to  about  1.5  kHz 
is  the  most  reliable  portion  of  the  spectrum  for  use  in  pitch 
detection. 

If  a periodic  pulse  waveform  is  fed  tc  a hcmcncrphic  pitch 
detector,  several  phenomena  will  be  observed.  First,  the  height 
cf  the  pitch  period  peak  in  the  cepstrcm  will  vary  with 
frequency.  This  suggests  a weighting  cf  the  cepstrum  such  that, 
over  the  region  of  interest,  the  h^rght  cf  the  peak  as  a function 
of  the  period  is  relatively  constant  to  allow  the  use  of  a fixed 
voicing  threshold.  This  weighting  function  should  be  relatively 
smooth  to  prevent  small  pitch  errors  caused  by  markedly  differing 


10 


weights  on  adjacent  samples 


A second  observation  is  that  the  "fleer”  frem  which  the 
pitch  period  peak  rises  ray  net  have  a constant  level.  (The 
cepstrum  may  be  viewed  as  vocal  tract  frequency  response 
information  in  the  lew  erder  terms  and  a lew  level  noise  floor 
elsewhere  which,  if  voiced,  contains  a peak  corresponding  to  the 
periodicity  of  the  source.  The  basic  problem  is  discrimination 
of  the  peak  from  the  noise  and  frequency  response  information.) 
Not  only  can  this  noise  floor  have  a nen-zero  DC  level,  tut  it 
may  be  tilted  or  eguivalently  have  a varying  regional  EC  level 
(have  low  frequency  terms  in  the  cepstrum).  As  these  deviations 
from  ideal  vary,  no  fixed  correction  can  be  used  tc  eliminate 
them.  A good  way  of  processing  this  noise  floor  is  "local  DC 
removal"  or  removal  of  the  low  frequency  terms  from  the  cepstrum. 
Any  threshold  operation  performed  on  the  pitch  period  peak  now 
will  see  just  the  peak  plus  noise  rather  than  the  peak  plus  noise 
plus  its  local  DC  level. 

A third  observation  is  the  presence  of  cepstral  peaks, 
usually  lower  in  amplitude  than  the  main  peak,  at  multiples  of 
the  pitch  period.  These  extraneous  peaks,  if  found  by  the  peak 
picker  instead  of  the  proper  peak,  will  cause  a doubling  (or 
tripling)  of  the  pitch  period  estimate.  The  amplitude  of  these 
peaks  is  a function  of  the  length  of  the  original  time  window  on 


1 


the  signal.  Too  long  a window  will  yield  high  amplitude 
extraneous  peaks.  (For  a constant  pitch  signal,  a lcnger  window 
will  yield  narrower  lcg-spectral  lines.  As  the  lcg-spectrum 
approaches  a periodic  impulse  train,  sc  will  the  cepstrum.)  A 
Hamming  window  about  U. 5 pitch  periods  lcng  will  simultaneously 
maintain  the  log-spectral  null  depth  to  maintain  the  desired  peak 
height  and  suppress  the  extraneous  peaks.  Since  a fixed  window 
cannot  be  wide  enough  fcr  low  pitched  voices  and  still  avoid 
doubling  on  high  pitched  voices,  the  window  size  must  be 
adaptive.  The  window  size  required  by  lew  pitched  speakers  is 
also  too  wide  to  preserve  short  term  staticnarity  in  the  speech 
signal.  This  loss  cf  stationarity  is  a compromise  which  must  be 
made  for  such  speakers  tut  would  cause  unnecessary  degradation  of 
performance  tor  higher  pitched  speakers. 

In  general,  simple  picking  of  the  highest  cepstral  point 
within  an  allowed  zene  cf  possible  pitch  periods  and  above  a 
certair  threshold  dees  net  constitute  a sufficiently  reliable 
pitch  detection  and  measurement  scheme.  The  desired  cepstral 
peak  is  subject  to  amplitude  jitter  due  tc  the  time  window  vs. 
time  signal  phase,  staticnarity,  and  (voiced)  signal  tc  (all 
ether)  noise  ratio.  If  the  sound  is  voiced,  the  dominant 
cepstral  peak  usually  indicates  the  correct  pitch,  but  the  peak 
amplitude  varies.  A more  accurate  scheme  than  simple  peak  height 


12 


I 

$ 


I 

• II 

r 

thresholding  is  required  for  the  voiced-unvoiced  decision. 

Even  with  all  of  the  above  ref inements,  a hcmcmor phic  pitch 
detector  still  makes  occasional  errors,  especially  at  voicing 
boundaries.  Some  form  cf  post  processing  to  correct  single 
errors  imbedded  in  correct  pitches  is  helpful.  This  smoothing, 
however,  must  preserve  the  voicing  boundaries.  A linear  lowpass 
cparation  will  not  work  (especially  if  an  unvoiced  frame  is 
represented  as  a pitch  cf  zero).  The  nonlinear  operation  of 
median  smoothing  [9]  (pick  the  median  cf  an  odd  number  of  the 
most  current  estimates)  appears  to  be  well  suited  to  the 
application. 

Finally,  the  pitch  de+ectcr  is  implemented  in  the  discrete 
domain.  What  sampling  rates  are  required  to  adequately  represent 
the  signals  involved?  The  original  signal  s(n)  can  be  assumed  to 
be  Nyquist  rate  sampled.  S(k),  the  discrete  Fourier  transform  of 
w(n)s(n)  must  have  at  least  as  many  samples  as  the  time  window 
w (n)  to  prevent  loss  of  information.  lcg|S(k)|,  however, 
represents  a sampled  distorted  S(f).  To  lessen  the  aliasing, 
lcg|S(k)|  must  have  a much  higher  sample  rate  than  that  required 
tc  represent  S(k).  The  verst  case  that  this  sampling  rate  must 
meet  is  the  low  pitched  male  speaker.  If  his  fundamental 
frequency  can  reach  50  Hz. , the  sampling  must  be  close  enough  to 
adequately  represent  a comb  structure  with  peaks  every  50  Hz.  It 


13 


is  not  known  how  dense  a sampling  is  required,  but  the  sampling 
used  in  this  implementation  (7  Hz)  appears  to  be  adequate.  It  is 
possible,  however,  that  an  even  denser  sampling  would  yield 
improved  results. 

V.  The  Implementation 

As  the  best  known  judge  of  a pitch  detector  is  the  human 
ear,  development  and  testing  of  the  pitch  detector  was  carried 
out  by  implementing  the  pitch  detector  (figure  4)  as  part  of  a 
real-time  vocoder.  An  existing  LDSP  [1]  implementation  of  an  LPC 
vocoder  was  used  as  the  test  vehicle.  The  operator  could  switch 
in  real  time  between  the  homomorphic  pitch  detectcr  ard  the 
original  Gold-Rabiner  [6]  pitch  detector,  while  using  the  same 
LPC  spectrum  analysis  and  synthesis  (20  mS  frame  interval,  12th 
order  autocorrelation  IPC  coded  to  3.6  kbit).  This  allowed  A-B 
comparison  of  the  homomorphic  pitch  detectcr  with  a known 
algorithm. 

The  original  speech  signal  is  lowpassed  at  3780  Hz, 
preemphasized  for  the  IPC  and  12  bit  sampled  at  a 132  uS  interval 
(7576  Hz) . This  sampled  signal  feeds  both  the  IPC  spectral 
analyzer  and  the  Gold-fabiner  pitch  detectcr.  For  input  to  the 
hcmcmcrphic  pitch  detector,  the  signal  is  again  lowpassed  at  a 
digital  frequency  cf  n/2  (189h  Hz)  and  downsampled  by  a factor  of 


1« 


I 


2.  The  downsampled  waveform  is  windowed  by  a Hamming  window  of 
100  to  250  samples  (26  tc  66  mS)  and  padded  with  sufficient  zeros 
to  fill  a 512  point  tuffer.  The  buffer  is  Fourier  transformed  by 
a real  FFT  and  the  logarithm  cf  the  magnitude  of  each  frequency 
point  is  computed.  (Only  the  positive  frequency  points  need  be 
computed  due  to  the  symmetries  inherent  in  the  Fourier 
transform.)  This  log  spectrum  is  Fourier  transformed  tc  produce  a 
downsampled  cepstrum.  Tc  prevent  the  spectral  window  from 
spreading  the  high  amplitude  low  order  cepstrum  into  the  pitch 
period  zone,  the  low  order  2.3  mS  of  the  cepstrum  are  zeroed 
before  the  cepstrum  is  filtered  to  implement  the  spectral  window 
(Figure  5).  The  pitch  period  peaks  of  this  function  are  now 
approximately  leveled  by  multiplying  it  by  the  weighting  function 
1 (n) 


1 (n)  = 


J1 

\l+.  0 1 (n- 


21) 


n<2 1 

2 1<n<  128 


(5) 


and  interpolated  back  tc  the  original  sample  rate  cf  132  uS  to 
create  the  modified  cepstrum  (Figure  6).  (Periods  measured  on 
this  cepstrum  now  relate  directly  to  tne  original  wavefcrm  when 
measured  in  samples.) 


16 


mm* 


Fig.  5.  Log  spectral  window.  (Implemented  as  a filter  on  the  Cepstrum) 
Impulse  response:  f(0)  = .52203,  f (1)  = f(-l)  = .23590,  f (-2)  -.24475, 
f (3)  = f (-3)  = .22653,  f(4)  = f(-4)  = -.02637. 


-2-14108 


LOG  SPECTRUM 
0-1.1  kHz 


MODIFIED  CEPSTRUM 
0-20  msec 


INPUT:  SUSTAINED  VOWEL  /a/ 

Fig.  6.  Waveforms  of  the  homomorphic  pitch  detector. 


18 


A modified  peak  picker  is  now  applied  to  the  cepstrum.  The 
picker  selects  the  highest  point  in  a window  (19  to  152  samples 
corresponding  to  pitch  periods  of  2.5  to  20  mS)  and,  if  the 
height  of  the  point  is  afcove  a given  threshold  and  the  point  is 
not  the  lowest  order  point  in  the  window,  chooses  this  point  as 
the  pitch  estimate.  If  the  highest  point  is  at  the  bottom  edge 
of  the  window,  it  is  assumed  to  be  the  side  of  a higher  spectral 
information  peak  outside  of  the  window.  The  threshold  (see  Table 
1)  is  intentionally  low  sc  that  any  doubtful  frames  are  initially 
chosen  to  be  voiced.  If  a pitch  estimate  is  not  found,  the  frame 
is  declared  to  be  unvoiced,  which  is  signaled  by  a pitch  estimate 
of  zerc. 


As  the  pitch  of  vcice 
continuity  of  the  pitch  es 
reliability  of  the  voiced- 
estimate  is  given  a ccir.ci 
times  that  its  pitch  is  wi 
the  twc  adjacent  frames, 
the  raw  outputs  of  *he  pea 
processed  estimate  to  prev 
loops.)  This  score  is  new 
thresholds  is  to  be  compar 
height  to  decide  whether  t 


d speech  usually  varies  smoothly, 
timates  can  be  used  to  improve  the 
unvoiced  decision.  The  current  frame 
dence  score  egual  to  the  number  of 
thin  8 samples  (1.1  mS)  cf  the  pitch  of 
(The  pitches  of  the  adjacent  frames  are 
k picker  rather  than  an  additionally 
ent  potentially  unstable  feedback 
used  to  choose  which  of  three 
ed  tc  the  current  cepstral  pitch  peak 
be  current  frame  is  voiced.  These 


thresholds,  in  arbitrary  units,  are: 


coincidence  score 
0 

1 

2 


peak  height  threshold 

18 

11 

8 


peak  picker 


8 


Voicing  decision  thresholds 
Table  1 


(The  units  are  a function  cf  the  implementation  choice  of  the 
base  of  the  logarithm,  the  binary  points  cf  the  logarithm,  the 
spectral  window,  the  gain  constant  in  front  of  the  inverse  FFT , 
and  the  cepstral  weighting  function.) 

A similar  scheme  which  used  the  two  previous  and  two 
following  pitch  estimates  was  also  tried.  Comparison  cf  the 
current  estimate  and  this  set  cf  four  yields  ten  ordered 
coincidence  groups  if  mirror  images  are  considered  identical. 
Each  of  these  coincidence  groups  had  an  associated  threshold 
which  was  compared  with  the  cepstral  peak  height  of  the  current 
frame  +o  make  the  voiced-unvoiced  decision.  This  scheme  yielded 
only  a slight  improvement  ever  the  simpler  scheme  and  was  judged 
net  werth  the  additional  complexity. 


20 


The  pitch  estimates  are  now  fed  through  a third  order  median 
smoother  [9].  This  operation  will  remove  single  errors  without 
shifting  the  voicing  boundaries.  The  median  smoother  also 
frequently  corrects  erroneous  pitch  estimates  at  voice  onset  and 
termination  and  generally  yields  smoother  sounding  speech  by 
removing  some  of  the  jitter  on  the  pitch  track.  (Median 
smoothers  will  handle  unvoiced  frames  correctly  if  they  are 
represented  by  a pitch  period  cf  zero.) 

The  output  of  the  median  smoother  is  used  in  two  ways:  it 
is  passed  to  the  synthesizer  as  the  best  estimate  of  the  pitch 
and  fed  to  the  adaptive  time  window  size  routine.  The  next 
window  size  is  computed  as  follows: 


size  (n+  1)  = 


limits: 


(size  (n) 

\2.25  (.  9size(n)  +.1pitch) 
26.4  m £ < size  < 66.0  mS 


p=0  (6a) 

p>0 

(6b) 


(As  the  window  is  in  the  downsampled  domain,  its  true  size  is  4.5 
average  pitch  periods.)  Cue  tc  the  delays  in  the  voiced- unvoiced 
decision  and  the  median  smoother,  the  pitch  is  delayed  several 
frames  and  therefore  the  window  size  is  changed  several  frames 
late.  As  the  window  size  need  not  track  the  pitch  accurately, 


21 


this  delay  appears  tc  cause  no  degradation  cf  the  results. 

The  above  operations  are  all  performed  with  16  bit 
arithm=tic.  The  data  for  the  FFTs  are  stored  in  blech  floating 
point  with  right  shifts  only  as  reauired  to  prevent  overflow. 

VI.  Results 

No  clearly  defined  objective  method  for  the  testing  of  pitch 
detectors  exists.  Errors  can  be  of  several  forms:  small  pitch 
errors,  gross  pitch  errors  and  voicing  decision  errors.  The 
perceptual  significance  cf  each  cf  the  errors  is  a function  of 
the  listener,  the  speaker,  the  spectral  analysis-synthesis 
algorithm,  and  where  in  the  speech  each  error  occurs.  Therefore, 
many  investigators  (including  this  one)  fall  back  on  subjective 
judgements  by  trained  listeners  who  can  frequently  classify  the 
type  of  error  as  well  as  its  presence.  As  the  pitch  detectors 
examined  here  are  implemented  in  a real-time  vocoder  with  real- 
time displays  of  the  vindewed  speech,  the  lcg-spectrum , the 
cepstrum,  and  the  pitch  track,  these  observations  of  performance 
are  based  on  hours  cf  listening  time  by  several  trained  listeners 
who  could  simultaneously  ctsarve  the  internal  workings  cf  the 
pitch  detector  as  they  listened. 

On  clear  speech,  the  homomorphic  pitch  detector  and  the 


22 


Gold-Babiner  algorithm  perform  similarly  for  male  speakers.  For 
female  speakers,  the  homcmcrphic  algorithm  makes  fewer  errors 
than  the  Gold- Babiner  algorithm.  Both  pitch  detectors  only  make 
occasional  errors  which  are  likely  to  he  perceived  as  glitches  ir. 
the  speech.  In  silent  intervals,  the  hcmc mcrphic  pitch  detector 
occasionally  finds  the  pitch  of  the  background  60  Hz  power  line 
hum.  This,  however,  is  cf  no  perceptual  significance  as  the 
energy  of  the  synthesized  output  is  +cc  lew  for  the  output  to  be 
audible.  Another  characteristic  error  of  the  homomorphic  pitch 
detector  is  occasional  "squeaks”  caused  ty  spectral  envelope 
information  appearing  in  the  pitch  zone  cf  the  cepstrum  being 
analyzed  as  a high  pitch.  The  homomorphic  algorithm  alsc 
determines  voiced  fricatives  to  be  voiced  which  appears  to  be  a 
perceptually  appropriate  decision. 

The  differences  between  the  pitch  detectors  become  much  more 
cbvious  when  applied  tc  corrupted  speech.  Ecth  pitch  detectors 
were  tasted  on  speech  corrupted  with  additive  noise 
characteristic  of  the  interior  of  a large  jet  aircraft  [11]. 

This  ncise  exhibits  a bread  spectral  peak  below  about  600  Hz.  At 
a signal  to  noise  ratio  cf  abcut  10  db,  the  noise  degrades  the 
Gold-Babiner  algorithm  mere  than  the  homomorphic  pitch  algorithm. 
The  perceptual  form  cf  the  errors  is  quite  different  fer  the  two 
pitch  detectors.  The  Gcld-Rabiner  pitch  detectcr  jumps  in  and 


23 


cut  cf  voicing  at  a high  enough  rate  tc  chop  up  the  speech. 
(Completely  removing  the  pitch  detector  and  declaring  all  frames 
unvoiced  would  probably  he  more  intelligible.)  The  hcmcmorphic 
pitch  detector  gives  fairly  gcod  pitch  estimates  except  for 
occasional  zones  where  it  devcices.  These  zones  tend  to  be  of  a 
syllabic  duration  and  are  perceived  as  a devoiced  syllable  and 
therefore  appear  to  do  less  damage  than  the  chopping  tc  the 
intelligibility  of  the  speech.  The  homomorphic  pitch  detector 
even  yields  reasonably  correct  analyses  when  the  signal  to  noise 
ratio  is  so  low  that  the  output  of  the  vocoder  is  unintelligible 
due  to  errors  in  the  LFC  spectrum  analysis. 

Comparisons  of  the  two  pitch  detectors  were  made  with 
narrowband  additive  noise.  The  noise  used  here  is  a 100  Hz  sine 
wave.  At  a signal  tc  noise  ratio  cf  about  0 db,  the  homomorphic 
pitch  detector  occasionally  finds  the  pitch  of  the  noise  during 
speech  silences  but  is  otherwise  unaffected.  Under  the  same 
conditions,  the  Gold-Eabiner  pitch  detector  yields  badly  "chopped 
up"  pitch.  At  a signal  tc  noise  ratio  cf  about  10  db  the 
homomorphic  pitch  detec+cr  is  essentially  unaffected  while  the 
Gold-Eabiner  pitch  detector  still  yields  badly  "chopped  up" 

Fitch.  As  the  signal  tc  ncise  ratio  increases  tc  about  30  or  U0 
db,  this  "choppiness"  gradually  decreases  and  vanishes.  (The 
Gold-Eabiner  pitch  detector  should  find  the  pitch  of  the  sine 


2U 


wave  during  the  silences  as  it  is  a direct  wavefcrm  measurement 
type  of  pitch  detector.  This  effect,  which  is  distinct  from  the 
"choppir.ess",  ceases  abcve  a signal  tc  noise  ratic  above  about  30 
cr  40  db  where  the  sine  wave  drops  below  ar  energy  threshold.) 

Comparison  of  the  pitch  detectors  on  telephone  degraded 
speech  also  indicated  differences  in  the  performance  of  the  two 
pitch  detectors.  The  telephone  simulator  [10]  which  was  used  for 
the  tests  has  two  sets  cf  parameters:  "mid"  representing  a 50th 
percentile  continental  US  long-distance  voice-grade  line  and  a 
"peer"  representing  a 90th  percentile  continental  US  long- 
distance voice-grade  line.  Eoth  settings  attempt  to  simulate  the 
bandpassing,  Gaussian  and  pulse  noise,  phase  distertier, 
freguency  distortion,  and  nonlinearity  cf  the  respective 
telephone  lines.  The  "mid"  telephone  line  causes  essentially  no 
degradation  of  the  hcmcmcr phic  pitch  detectcr  but  causes  some 
annoying  oscillation  in  and  out  cf  voicing  by  the  Gold-Eabiner 
pitch  detector.  On  the  hcmomorphic  pitch  detector,  the  "poor" 
telephone  line  causes  seme  devcicing  cf  syllabic  duration  and  a 
few  "sgueaks",  neither  cf  which  seriously  impair  intelligibility. 
The  Gold-Rabiner  algorithm  exhibits  severe  devoicing  and  rapid 
oscillation  in  and  cut  cf  voicing  which  cause  severe  damage  to 
the  intelligibility  cf  the  speech.  fts  the  telephone  simulator 
high-pass  filtered  the  speech  with  a cutcff  of  about  300  Hz, 

25 


* 


significant  amounts  cf  the  information  used  by  both  pitch 
detectors  were  removed.  Versions  of  the  homomorphic  pitch 
detector  which  used  a spectral  window  which  deemphasized  the  low 
frequency  region  exhibited  less  degradation  due  to  the  telephone 
simulator. 

VII.  Discussion 

Clear  speech  allows  a pitch  detector  many  design  options. 
Direct  waveform  processing  and  measurements  are  possible.  These 
techniques,  which  allow  the  pitch  detector  to  analyze 
ncnstationary  voicing  almost  as  effectively  as  stationary 
voicing,  degrade  in  the  presence  of  ncise.  Wavefcrm  pitch 
detectcrs  car.  no  longer  accurately  locate  peaks  and  zero 
crossings  which  may  be  obscured  by  additive  noise.  Distortion 
for  spectral  leveling  [7],  a ccmmonly  used  preprocessing 
technique  for  correlation  type  pitch  detectcrs,  new  creates 
interfering  cross  terms  between  the  noise  and  the  speech. 


Pitch  detection  in  the  presence  cf  ncise  reguires  the  use  of 
tha  coherence  found  in  the  voiced  excitaticn  to  differentiate  the 
signal  from  the  noise.  Pitch  detectors  which  are  robust  with 
respect  to  input  speech  degradation  therefore  must  yield  some  of 
their  clear  speech  performance  on  nonstaticnary  voicing.  The 
homomorphic  pitch  detector  attempts  to  exploit  this  coherence  in 


26 


several  ways  tc  achieve  its  robustness.  Generation  cf  the 
complex  spectrum  exploits  (and  requires)  the  coherence  of  the 
voiced  excitation.  Taking  the  magnitude  of  this  complex  spectrum 
maximizes  the  phase  coherence  cf  its  periodic  line  structure. 

The  logarithm,  in  conjunction  with  the  original  time  window 
(which  sets  the  line  shape  and  width) , maps  the  the  line 
structure  of  the  magnitude  spectrum  into  a relatively  constant 
amplitude  line  structure  plus  some  slowly  varying  terms.  (This 
log  magnitude  operation  degrades  gracefully  in  the  presence  of 
noise  by  the  nulls  cf  the  leg  spectrum  becoming  "filled  in"  by 
the  noise.)  The  second  FFT  then  exploits  the  coherence  cf  this 
constant  amplitude  periodic  line  structure  to  generate  the 
cepstral  peak  which  indicates  the  presence  and  pitch  cf  voicing. 

The  majority  of  the  homomorphic  pitch  detector  errors  fall 
into  two  classes.  The  pitch  is  sometimes  estimated  incorrectly 
at  voicing  onset  or  termination.  Nonstaticnarity  in  the  voicing 
in  both  pitch  and  amplitude  tends  to  concentrate  at  these  points 
making  them  obvious  trouble  spots  for  a coherent  pitch  detector. 

(A  possible  scheme  to  improve  the  pitch  detector's  tolerance  to 
voicing  nonstatio,narity  is  outlined  in  Appendix  A.) 

The  second  difficulty  is  the  voiced-unvoiced  decision.  If 
the  decision  is  heavily  biased  toward  voicing  so  that  few  voiced 
to  unvoiced  errors  cccur,  the  pitch  estimates  generally  appear 

27 


. ...  " . . . - '-art"  • ■ - 


accurate.  The  voiced-unvoiced  decision  is  based  on  a Fitch 
continuity  dependent  threshold  placed  on  the  cepstral  peak 
height.  What,  then,  affects  this  parameter?  The  height  of  the 
peak  is  the  strength  of  a frequency  component  in  the  windowed  log 
spectrum.  The  strength  of  this  component  is  a function  of  the 
periodic  signal-to-ncise  ratic.  (Noise  "fills  in"  the  nulls  of 
the  comb  spectrum  and  reduces  the  amplitude  of  the  component.) 
Varying  frequency  and  amplitude  of  the  voiced  excitation  reduce 
the  coherence  of  the  voicing  and  spread  and  reduce  the  height  of 
the  peaks  of  the  comb  spectrum.  The  vocal  tract  filter  (or  audio 
channel)  can  have  differing  delays  for  the  different  harmonics  of 
the  source  which,  if  the  pitch  is  changing,  will  spread  and  lower 
the  cepstral  peak.  The  phase  effects  of  the  vocal  tract  filter 
and  channel  are  of  no  consequence  since  the  log  spectrum  is  phase 
insensitive,  except  that  changing  phase  shifts  in  the  tract  and 
channel  will  shift  the  frequencies  of  the  harmonics  and  degrade 
the  cepstral  peak.  Clearly  a mere  accurate  measure  of  voicing 
signal  to  noise  ratio  is  desirable. 

The  spectral  window  remains  the  most  mysterious  part  of  the 
pitch  detector  and  a possible  target  for  future  development. 

(The  window  design  is  a simultaneous  dual  domain  design  problem.) 
The  test  windows  that  were  found  tended  tc  be  quite  different 
from  those  that  one  would  expect  to  be  optimum.  The  best  so  far 


28 


> 


1 

appears  to  have  a single  broad  peak  with  an  upper  cutoff  at  about  j 

1 to  1.2  kHz  (Figure  5).  Tests  of  the  classic  rectangle  with 

J 

smoothed  discontinuities  give  poor  results  for  nc  apparent 

reason.  Tailoring  of  the  window  to  specific  applications  appears 

to  be  possible  where  the  signal-to-noise  ratio  varies  ever  the  j 

spectrum.  Specific  attempts  have  not  beer,  made  tc  design  a 

window  for  either  of  the  corruptions  mentioned  earlier,  tut  two 

j 

windows  which  give  similar  performance  on  clear  speech  give  j 

varying  performance  on  the  corrupted  speech. 

Seme  of  the  earlier  work  on  the  homomorphic  algorithm 
suggests  the  possibility  of  integration  of  the  spectral  analysis  1 

and  the  pitch  detector  [5,6].  This  investigation  suggests  that  1 

performance  may  suffer  as  a result  of  such  a sharing  of 
processing.  Succeeding  workfes  indicated  that  performance  of  the 
spectral  estimator  over  a wide  range  of  pitches  reguires  an 
adaptive  time  window  which  is  much  smaller  (about  3 pitch 
periods)  than  the  windew  required  for  the  pitch  detector.  This 
would  prevent  the  sharing  of  the  initial  FFT  as  well  as  any  of 
the  succeeding  processing. 

VIII.  Summary 

The  homomorphic  pitch  detector  has  existed  for  quite  some 
time  in  a comparatively  undeveloped  state  due  to  the  difficulties 


29 


cf  real-time  implementation.  This  investigation  reveals  that, 
with  suitable  modification,  it  is  capable  of  clear  speech 
performance  rivaling  cne  of  the  better  pitch  detectors  in  current 
use.  Tests  on  two  fcrms  cf  corrupted  speech  indicate  the 
homomorphic  algorithm  is  also  more  robust  with  respect  to 
corruption  of  the  speech  than  is  the  Gold-Pabiner  algorithm, 
which  allows  the  effective  use  of  a vocoder  in  a less  than  ideal 
environment.  As  all  of  the  operations  reguired  to  achieve  this 
performance  can  likely  he  implemented  in  real-time  using  CCD 
technology  and  microprocessors,  this  pitch  detector  car.  probably 
he  built  with  fairly  simple  hardware. 


30 


IX 


Appendix 


The  homomorphic  pitch  detector 
coherence  seeking  pitch  detectors, 
generally  use  signal  processing  tech 
waveform  into  a form  more  suitable  f 
estimation.  They  tend,  however,  to 
deviates  form  the  ideal  irodel  in  any 
appendix  describes  a pilct  study  of 
improving  the  speech  generaticn  mode 
coherence  seeking  pitch  detectcrs. 


is 

cne  of  a 

class 

cf 

Th 

ese  pitch 

detect 

ers 

ni 

gues  to  p 

recess 

the  s 

peech 

or 

pitch  de 

tect ion 

and 

de 

grade  whe 

n the  s 

peech 

c 

f several 

ways. 

This 

a 

se+  cf  pr 

eproces 

sers 

for 

1 

errcr  tol 

erance 

cf 

Basically,  the 
techniques  for  proce 
contains  a peak  to  i 
The  problem  then  tec 
decision  based  on  it 
waveform. 

A large  number 
report,  lessen  the  p 
One  of  these  factors 
This  nonstationarity 
pitch.  The  amplitud 
boundaries  where  the 


home 

morphic 

pitch 

det 

ector 

uses 

hcmomorphic 

ssin 

g the  s 

peech  s 

igr 

a 1 in 

tc  a 

waveform  which 

ndic 

ate  the 

presen 

ce 

and  p 

itch 

cf  voicing. 

oraes 

detect 

ion  of 

the 

peak 

’ s lc 

cation  and  a 

s he 

ight  in 

the  fa 

ce 

c f no 

ise  o 

n the 

cf  f 

actors , 

as  en'u 

mer 

ated 

earli 

er  in  this 

eak 

height  . 

and  red 

uce 

its 

discr 

iminability. 

is 

nenstationar  it 

y i 

n the 

voiced  excitation. 

can 

appear 

in  two 

f c 

r ms : 

ampli 

tude  and 

e nc 

nstatic 

narity 

is 

BCSt 

sever 

e at  voicing 

perceptual 

conseq 

uen 

ces  o 

f an 

errcr  are 

31 


minimal.  Pitch  nonstaticnar ity , while  frequently  severe  for  a 
few  pitch  periods  a*  voicing  boundaries,  occurs  throughout 
speech.  It  may  be  possible  tc  modify  the  homomorphic  pitch 
detector  to  provide  a greater  tolerance  tc  these  viclaticns  of 
the  analysis  model. 

One  means  of  increasing  tolerance  tc  a particular  model 
error  is  to  postulate  the  errcr  and  modify  the  algorithm  such 
that  it  can  functior  in  the  presence  of  the  error  (even  if  it 
fails  without  the  error).  A mapping  of  the  speech  signal  to 
regenerate  a violated  assumption  is  one  such  means.  If  the 
original  signal  is  increasing  in  amplitude,  multiply  it  by  a 
decaying  function  (such  as  a decaying  exponential)  to  restore  its 
amplitude  stationarity.  If  the  pitch  period  of  the  voiced 
excitation  is  changing,  time  warp  the  waveform  to  return  it  to  a 
constant  periodicity  signal. 

No  models  exist  which  give  a functional  description  of  pitch 
changes.  The  pitch,  however,  generally  varies  smoothly  and  not 
toe  rapidly.  Fostulate  a periodic  signal: 


f(t)=f(t  + p)  (A1) 

t=time 
p=psr icd 


32 


Time  warp  f to  create  a signal  of  varying  period: 


g(t)=f  (s(t)) 

s=warp  function 


(A2) 


Fg  (t)  = (dt)/(ds)  p 

Pg=pericd  of  g 


(A3) 


Assume  a linear  pitch  change: 


Pg(t)  = (1  + at)  F 


(A4) 


To  recover  f (s)  =g  (t  (s)  ) : 


ds= (p/Pg) dt 


(A5) 


t 


£<t)  = / 1/(1  + at)  at 
0 


( A 6 ) 


= ( 1/a) log  ( 1 + at) 


( A 7 ) 


33 


I 


as 

t(s)  = (1/a)(e  -1)  (A8) 

t (s)  correcting  warp 

2 

~ s+  (a/2)  s (A9) 

as  small 

Thus  a linear  time  warping  cf  a linearly  varying  pitch  signal 
will  approximately  map  the  signal  into  a constant  pitch  signal. 

Non-real-time  simulations  of  a time  warp  feeding  the 
hcmomcrphic  analyzer  have  been  performed  on  speech.  The  time 
warp  was  implemented  with  a 51  section  lowpass  filter  to 
interpolate  the  input  waveform.  The  cepstrum  (c  (n) ) as  computed 
by  the  procedure  of  Figure  4 was  considered  the  final  output  from 
which  judgements  of  the  pitch  peak  enhancement  were  made.  As 
illustrated  in  Figures  A1  and  A2,  instances  of  peak  enhancement 
were  found  in  a short  section  cf  speech  which  was  searched. 

(The  positions  of  the  pitch  peaks  are  skewed  as  a function 
of  a as  t=0,  i.e.,  the  point  cf  zero  warp,  was  at  the  left  edge 
of  the  Hamming  window.  In  a real  implementation,  this  effect 
would  be  corrected  so  that  the  point  of  zero  warp  would  be  the 
center  of  the  window  thus  removing  the  dependence  cf  the  peak 
position  on  a. ) 


34 


A priori,  one  cannot  know  whether  time  warping  (or  amplitude 
correction)  will  enhance  the  pitch  period  peak  in  the  cepstrum. 

An  implementation  using  any  of  these  technigues  wculd  therefore 
te  required  to  run  one  full  pitch  detector  per  preprocessor  and 
base  its  final  decision  cn  the  outputs  (cr  intermediate  results) 
of  all.  As  specific  attempts  will  then  have  been  made  to  correct 
for  deficiencies  of  the  basic  pitch  detector  the  voicing  decision 
thresholds  of  the  individual  pitch  detectors  might  be  raised  to 
lessen  the  probability  of  a false  voicing  detection. 

Such  a pitch  detector  might  give  improved  performance  in 
several  ways.  As  most  cf  the  clear  speech  errors  of  the 
homomorphic  pitch  detector  occur  at  voicing  boundaries  (which  are 
the  regions  of  highest  ncnstaticnarity)  , the  technique  might 
prevent  some  of  these  errors.  In  fact,  fiqures  A1  and  A2  and 
most  ether  zones  where  the  technique  was  found  to  enhance  the 
pitch  peak  were  at  or  near  vcicinq  boundaries. 

The  pitch  detector  might  also  become  mere  robust.  If  the 
speech  is  degraded  by  additive  noise,  the  pitch  peak  height  would 
be  reduced.  The  major  perceived  error  in  airborne  command  post 
noise  is  devoicing  for  occasional  sections  of  approximately 
syllabic  length.  If  this  is  a cumulative  effect  of  the  noise  and 
nonstationarity  in  the  original  speech  reducing  the  peak  height, 
the  preprocessor  might  be  able  to  sufficiently  correct  the 


35 


ccnstationarity  to  allow  pitch  detection 


Both  of  the  preprocessing  techniques  postulated  here  are 
based  on  the  assumption  that  the  voicing  is  nonstationary  in  some 
way  or  combination  of  ways  and  attempt  to  provide  a "mere 
stationary"  signal  for  the  pitch  detector.  These  preprocessors 
therefore  might  be  useful  to  any  pitch  detector  which  searches 
for  a coherent  periodic  component  in  the  speech  signal. 


38 


X.  Bibliography 


1.  The  Lincoln  Digital  Signal  Processor  is  an  advanced  version  of  the 
LDVT.  See  P.  E.  Blankenship  et  al.,  "The  Lincoln  Digital  Voice 
Terminal  System,"  Technical  Note  1975-53,  Lincoln  Laboratory,  M.I.T. 
(25  August  1975),  DDC  AD-A017569/5 ; or  P.  E.  Blankenship,  "LDVT: 

High  Performance  Minicomputer  for  Real-Time  Speech  Processing," 
EASCON'75,  pp.  214a-214g. 

2.  J.  L.  Flanagan,  Speech  Analysis  Synthesis  and  Perception  (Springer- 
Verlap,  New  York,  1972). 

3.  E.  M.  Hofstetter  et  al. , "Vocoder  Implementations  on  the  Lincoln 
Digital  Voice  Terminal,"  EASCON  '75,  pp.  32a-32j . 

4.  P.  Lieberman,  J.  Acoust.  Soc.  Am.  3J3,  597-603  (1961). 

5.  A.  M.  Noll,  J.  Acoust.  Soc.  Am.  4_1,  293-302  (1964). 

6.  A.  V.  Oppenheim  and  R.  W.  Schafer,  IEEE  Trans.  Audio  Electroacoust . 
AU-16,  221-226  (1968),  DDC  AD-678238. 

7.  L.  R.  Rabiner,  IEEE  Trans.  Acoust.,  Speech,  and  Signal  Processing 
ASSP-25 , 24-33  (1977). 

8.  L.  R.  Rabiner  and  B.  Gold,  Theory  and  Applications  of  Digital  Signal 
Processing  (Prentice-Hall,  Englewood  Cliffs,  New  Jersey,  1975). 

9.  L.  R.  Rabiner,  M.  R.  Sambur,  and  C.  E.  Schmidt,  IEEE  Trans.  Acoust., 
Speech,  and  Signal  Processing  ASSP-23 , (1975). 

10.  S.  Seneff,  "A  Real-Time  Digital  Telephone  Simulation  on  the  Lincoln 
Digital  Voice  Terminal,"  Technical  Note  1975-65,  Lincoln  Laboratory, 
M.I.T.  (30  December  1975),  DDC  AD-A021409/8 . 

11.  Speech  tapes  containing  noise  characteristic  of  an  airborne  command 
post  were  supplied  by  the  Defense  Communications  Agency. 


39 





UNCLASSIFIED 


SECURITY  CLASSIFICATION  OF  THIS  PAGE  (When  Data  Entered) 


REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS 

BEFORE  COMPLETING  FORM 

1.  REPORT  NUMBER  2.  GOVT  ACCESSION  NO. 

ESD-TR-78-252 

3.  RECIPIENT'S  CATALOG  NUMBER 

4 TITLE  fund  Subtitle) 

5.  TYPE  OF  REPORT  A PERIOD  COVERED 

Homomorphic  Pitch  Detection 

Technical  Note 

6.  PERFORMING  ORG  REPORT  NUMBER  , 

Technical  Note  1978-32 

7.  AUTHORS 

8,  CONTRACT  OR  GRANT  NUMBERS 

Douglas  B.  Paul 

F19628-78-C-0002 

9.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

Lincoln  Laboratory,  M.l.T. 

P.O.  Box  73 

Lexington,  MA  02173 

10.  PROGRAM  ELEMENT.  PROJECT,  TASK 

AREA  i.  WORK  UNIT  NUMBERS 

Program  Element  No.  62706E 
Project  Code  8P10 

ARPA  Order  2006 

11.  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Defense  Advanced  Research  Projects  Agency 

1400  Wilson  Boulevard 

12.  REPORT  DATE 

15  August  1978 

Arlington,  VA  22209 

13.  NUMBER  OF  PAGES 

46 

14  MONITORING  AGENCY  NAME  & ADDRESS  (if  different  from  Controlling  Office ) 

15.  SECURITY  CLASS,  (of  this  report ) 

Electronic  Systems  Division 

Han  scorn  AFB 

Unclassified 

Bedford,  MA  01731 

15a.  DECLASSIFICATION  DOWNGRADING 

SCHEDULE 

16.  DISTRIBUTION  STATEMENT  (of  this  Report) 

Approved  for  public  release;  distribution  unlimited. 

17.  DISTRIBUTION  STATEMENT  (of  the  distract  entered  in  Block  20,  if  different  from  Report ) 

18.  SUPPLEMENTARY  NOTES 

None 

19.  KEY  WORDS  (Continue  on  reverse  side  if  necessary  and  identify  by  block  number) 

homomorphic  pitch  detector 
real-time  LPC  vocoder 

broadband  noise 
narrowband  noise 

TO  ABSTRACT  rConlinuc  on  reverie  it da  if  memory  and  identify  by  block  number) 


This  note  describes  a homomorphic  pitch  detector  which  yields  good  performance  on  clear 
speech  and  moderate  robustness  to  additive  broadband  noise,  narrowband  noise,  and  to  degradation 
by  a telephone  simulator.  It  achieves  the  performance  by  the  use  of  an  adaptive  time  window, 
log-spectral  windowing,  an  adaptive  voicing  threshold,  and  pitch  track  smoothing.  It  has  been 
Implemented  In  a real-time  LPC  vocoder  for  testing.  Finally,  a pilot  study  of  a preprocessor  to 
Improve  the  performance  of  any  coherence  seeking  pitch  detector  is  presented. 


DD  ( j°*Mn  1473  EDITION  OF  t NOV  <5  IS  OBSOLETE 


UNCLASSIFIED 

SECURITY  CLASSIFICATION  OF  THIS  PAGE  (When  Data  Entered) 


